Skip to content

Add TinyOpenFold: GPU Optimization Tutorial with AlphaFold 2 Evoformer#164

Open
asitav wants to merge 41 commits into
amd:mainfrom
asitav:tiny_openfold
Open

Add TinyOpenFold: GPU Optimization Tutorial with AlphaFold 2 Evoformer#164
asitav wants to merge 41 commits into
amd:mainfrom
asitav:tiny_openfold

Conversation

@asitav
Copy link
Copy Markdown
Contributor

@asitav asitav commented May 29, 2026

Summary

This PR adds TinyOpenFold, a comprehensive educational example demonstrating GPU optimization techniques on AMD GPUs. The tutorial progressively implements an AlphaFold 2 Evoformer architecture from baseline PyTorch to custom Triton kernels.

Key Features

  • Three Progressive Optimization Stages:

    • V1 (Baseline): Clean PyTorch implementation
    • V2 (Kernel Fusion): PyTorch-level optimizations with kernel fusion
    • V3 (Custom Triton Kernels): Hand-optimized GPU kernels using Triton
  • Comprehensive Profiling Integration:

    • PyTorch Profiler for high-level bottleneck identification
    • rocprof-sys for system-level GPU traces and kernel timelines
    • rocprofv3 for detailed kernel metrics and launch counts
    • rocprof-compute for hardware counter analysis and memory bandwidth
  • Complete Educational Pipeline:

    • Step-by-step optimization tutorial with detailed explanations
    • Ablation studies showing individual optimization contributions
    • Performance analysis with bottleneck decomposition
    • ROCm profiling tool integration at each stage

What's Included

MLExamples/TinyOpenFold/
├── README.md                              # Main documentation
├── ARCHITECTURE.md                        # Evoformer architecture details
├── PERFORMANCE_OPTIMIZATION_TUTORIAL.md   # Complete optimization guide
├── optimization_tutorial.sh               # Automated tutorial script
├── version1_pytorch_baseline/            # Baseline PyTorch (V1)
│   ├── tiny_openfold_v1.py
│   ├── run_deepspeed_flops.sh           # FLOPs analysis
│   └── FLOPS_ANALYSIS.md                # FLOPs profiling guide
├── version2_pytorch_fused/               # Kernel fusion (V2)
│   ├── tiny_openfold_v2.py
│   ├── run_rocprofv3.sh                 # Kernel profiling
│   ├── run_rocprof_sys.sh               # System profiling
│   └── run_rocprof_compute.sh           # Hardware counters
└── version3_triton/                      # Custom kernels (V3)
    ├── tiny_openfold_v3.py
    ├── launch_performance_study.sh      # V1/V2/V3 comparison
    └── profiling scripts

Problem Sizes

The tutorial demonstrates optimization across different problem sizes:

  • Small: 64 residues, 16 MSA sequences, batch size 4
  • Medium: 128 residues, 32 MSA sequences, batch size 2

Educational Value

This example teaches:

  • Systematic GPU optimization methodology
  • Profiling techniques (PyTorch Profiler + ROCm tools)
  • Kernel fusion strategies with PyTorch
  • Custom GPU kernel development with Triton
  • Performance analysis and bottleneck identification
  • Memory vs. speed trade-offs
  • AlphaFold 2 Evoformer architecture

Testing

  • Tested on AMD Instinct MI300X with ROCm 7.2
  • All three versions produce numerically identical outputs
  • Includes validation mode (--validate-setup)
  • Multi-GPU scaling tested (1, 2, 4, 8 GPUs)
  • Comprehensive profiling integration verified

Documentation

  • Complete optimization tutorial (PERFORMANCE_OPTIMIZATION_TUTORIAL.md) with step-by-step guide
  • Architecture documentation (ARCHITECTURE.md)
  • Version-specific READMEs for each implementation
  • Automated tutorial script (optimization_tutorial.sh)
  • ROCm profiling tool integration guide

Target Audience

  • ML engineers learning GPU optimization
  • Researchers working with protein structure prediction
  • Students studying AlphaFold 2 architecture
  • Developers optimizing deep learning workloads on AMD GPUs

Related

This example complements existing HPC Training Examples by providing:

  • Real-world ML optimization case study
  • ROCm profiling tool usage examples
  • Triton kernel development tutorial
  • Multi-stage optimization methodology

Ready to merge: All code tested, documentation complete, and profiling integration verified.

asitav and others added 30 commits November 5, 2025 18:14
…hon for Python call stack profiling. Updated default parameters for batch size and sequence length to optimize output size. Enhanced README with detailed usage instructions and output file descriptions.
asitav and others added 11 commits January 13, 2026 15:42
- Replace manual wheel downloads with pip install from nightly repository
- Update requirements.txt with new PyTorch versions
- Simplify installation process
- Update run_rocprof_sys.sh file.
…ling

- Add automatic detection of rocpd package availability
- Conditionally enable ROCPROFSYS_USE_ROCPD only if rocpd is found
- Set ROCPROFSYS_CONFIG_FILE only if ~/.rocprof-sys.cfg exists
- Add --trace flag to rocprof-sys-python command
- Update help text with accurate configuration information
- Updated venv names from venvOF/venvOFr711 to simple venv
- Changed ROCm module from 7.1.1 to 7.2 (PyTorch still uses ROCm 7.1 nightly)
- V2 README now references main README for environment setup to avoid duplication
- Updated requirements.txt with current dependencies
- Add pip install command for requirements_rocprof-compute-develop.txt in README.md
- Include new requirements file for rocprof-compute development dependencies
- Source: https://github.com/ROCm/rocm-systems/blob/develop/projects/rocprofiler-compute/requirements.txt
Includes comprehensive tutorial docs and automated test script for demonstrating
progressive optimization from baseline PyTorch to custom Triton kernels.
- Optimized FLOPS_ANALYSIS.md for conciseness
- Removed redundant files and scaling scripts
- Removed exercises directories from v2 and v3
- Updated documentation references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant